Regression problems involve the prediction of a continuous, numeric value from a set of characteristics.

In this example, we'll build a model to predict house prices from characteristics like the number of rooms and the crime rate at the house location.

Reading data

We'll be using the pandas package to read data.

Pandas is an open source library that can be used to read formatted data files into tabular structures that can be processed by python scripts.



In [1]:

    
# Make sure you have a working installation of pandas by executing this cell
import pandas as pd

In this exercise, we'll use the Boston Housing dataset to predict house prices from characteristics like the number of rooms and distance to employment centers.



In [2]:

    
# Read 'datasets/boston.csv' with pandas
boston_housing_data = pd.read_csv('../datasets/boston.csv')

Pandas allows reading our data from different file formats and sources. See this link for a list of supported operations.

The head() method prints five entries by default. It can receive an optional argument to specify how many lines must be printed, like boston.head(n=10)



In [3]:

    
# Use the head() method to print the first five entries in the dataset
boston_housing_data.head()

The info() method shows several details about the dataset, like how many entries there are in the dataset, what features are present, what's the data type of each feature and if there are any missing data in a feature.



In [4]:

    
# Use the info() method to print information about the dataset
boston_housing_data.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
crim       506 non-null float64
zn         506 non-null float64
indus      506 non-null float64
chas       506 non-null int64
nox        506 non-null float64
rm         506 non-null float64
age        506 non-null float64
dis        506 non-null float64
rad        506 non-null int64
tax        506 non-null int64
ptratio    506 non-null float64
lstat      506 non-null float64
medv       506 non-null float64
dtypes: float64(10), int64(3)
memory usage: 51.5 KB

The describe() method only shows the summary statistics for columns of numeric types. If a column contains strings, for instance, it won't be able to calculate these descriptors.



In [5]:

    
# Use the describe() method to print summary statistics of the dataset
boston_housing_data.describe()

Visualizing data

After reading our data into a pandas DataFrame and getting a broader view of the dataset, we can build charts to visualize tha "shape" of the data.

We'll use python's Matplotlib library to create these charts.

An example

Suppose you're given the following information about four datasets:



In [6]:

    
datasets = pd.read_csv('../datasets/anscombe.csv')

for i in range(1, 5):
    dataset = datasets[datasets.Source == 1]
    print('Dataset {} (X, Y) mean: {}'.format(i, (dataset.x.mean(), dataset.y.mean())))

print('\n')
for i in range(1, 5):
    dataset = datasets[datasets.Source == 1]
    print('Dataset {} (X, Y) std deviation: {}'.format(i, (dataset.x.std(), dataset.y.std())))

print('\n')
for i in range(1, 5):
    dataset = datasets[datasets.Source == 1]
    print('Dataset {} correlation between X and Y: {}'.format(i, dataset.x.corr(dataset.y)))









    



Dataset 1 (X, Y) mean: (9.0, 7.500909090909093)
Dataset 2 (X, Y) mean: (9.0, 7.500909090909093)
Dataset 3 (X, Y) mean: (9.0, 7.500909090909093)
Dataset 4 (X, Y) mean: (9.0, 7.500909090909093)


Dataset 1 (X, Y) std deviation: (3.3166247903554, 2.031568135925815)
Dataset 2 (X, Y) std deviation: (3.3166247903554, 2.031568135925815)
Dataset 3 (X, Y) std deviation: (3.3166247903554, 2.031568135925815)
Dataset 4 (X, Y) std deviation: (3.3166247903554, 2.031568135925815)


Dataset 1 correlation between X and Y: 0.81642051634484
Dataset 2 correlation between X and Y: 0.81642051634484
Dataset 3 correlation between X and Y: 0.81642051634484
Dataset 4 correlation between X and Y: 0.81642051634484

They all have roughly the same mean, standard deviations and correlation. How similar are they?

This dataset is known as the Anscombe's Quartet and it's used to illustrate how tricky it can be to trust only summary statistics to characterize a dataset.



In [7]:

    
# Importing matplotlib for the first time may show a 
# warning message about the system's fonts
import matplotlib.pyplot as plt
# This line makes the graphs appear as cell outputs rather than in a separate window or file.
%matplotlib inline



In [8]:

    
# Extract the house prices and average number of rooms to two separate variables
prices = boston_housing_data.medv
rooms = boston_housing_data.rm

# Create a scatterplot of these two properties using plt.scatter()
plt.scatter(rooms, prices)
# Specify labels for the X and Y axis
plt.xlabel('Number of rooms')
plt.ylabel('House price')
# Show graph
plt.show()



In [9]:

    
# Extract the house prices and average number of rooms to two separate variables
prices = boston_housing_data.medv
nox = boston_housing_data.nox

# Create a scatterplot of these two properties using plt.scatter()
plt.scatter(nox, prices)
# Specify labels for the X and Y axis
plt.xlabel('Nitric oxide concentration')
plt.ylabel('House price')
# Show graph
plt.show()

Predicting house prices

We could see in the previous graphs that some features have a roughy linear relationship to the house prices. We'll use Scikit-Learn's LinearRegression to model this data and predict house prices from other information.

The example below builds a LinearRegression model using the average number of rooms to predict house prices:



In [10]:

    
# First extract the predictors(the feature(s) that will be used to 
# predict the house prices) and the outcome(the house prices) into 
# different variables.

x = rooms.values.reshape(-1, 1) # Extract the values from the 'rm' column here
y = prices.values.reshape(-1, 1) # Extract the values from the 'medv' column here

print('x: {}'.format(x[0:3, :]))
print('y: {}'.format(y[0:3]))









    



x: [[ 6.575]
 [ 6.421]
 [ 7.185]]
y: [[ 24. ]
 [ 21.6]
 [ 34.7]]

The values.reshape(-1, 1) method call is necessary in this case because scikit-learn expects the predictors to be in matrix form - i.e. it must be a bi-dimensional array. Since we're using a single predictor, pandas returns it as a one-dimensional array, so we have to reshape it to a "single-column matrix". This step is not necessary when we use more that one predictor to fit a scikit-learn model, as will be seen in the next example.

Now that we have the dataset isolated into predictor and outcome variables, they must be split into two different sets: a training set and a test set. This step is necessary if you want to be able to estimate how well the trained model will behave when it's used to predict prices of new houses: you must first use the training set to train the model and then calculate its error on the test set.



In [11]:

    
# Use sklean's train_test_plit() method to split our data into two sets.
# See http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
from sklearn.model_selection import train_test_split

RANDOM_STATE = 4321
xtr, xts, ytr, yts = train_test_split(x, y, random_state=RANDOM_STATE) # call train_test_split here

If we try to estimate the model performance in the same data that was used to train it, we might get a biased evaluation since the model was trained to minimize its error in this data set. In order to estimate how well the model will behave in practice, it must be evaluated in a separate data set.



In [12]:

    
from sklearn.linear_model import LinearRegression

lr = LinearRegression().fit(xtr, ytr) # fit a LinearRegression model into the training data.

lr.predict(6)









    Out[12]:





array([[ 19.81986275]])



In [13]:

    
# Calculate prices predicted by the trained model
predicted_prices = lr.predict(x)

# Create a scatterplot of these two properties using plt.scatter()
plt.scatter(rooms, prices)
# Create a line plot showing the predicted values in red
plt.plot(rooms, predicted_prices, 'r')
# Specify labels for the X and Y axis
plt.xlabel('Number of rooms')
plt.ylabel('House price')
# Show graph
plt.show()

Now we can use Scikit-Learn's mean_squared_error function to calculate the model's error in the test data set.



In [14]:

    
# Use the test set to assess the model's performance.
from sklearn.metrics import mean_squared_error

# Calculate the model's mean_squared_error here
mean_squared_error(yts, lr.predict(xts))









    Out[14]:





50.248441237303979

We'll now use all the features in the dataset to predict house prices and see how it improves the model's performance.



In [15]:

    
X = boston_housing_data.drop('medv', axis=1) # Use the drop() method to drop the 'medv' column and keep the others.
y = boston_housing_data.medv # Extract the house values from the 'medv' column.

X.head()

The drop() method will, by default, act on rows instead of columns. In order to drop columns, we have to pass an additional argument 'axis=1' to inform that we're dropping columns instead of rows.



In [16]:

    
# Use sklean's train_test_plit() method to split our data into two sets.
# See http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
from sklearn.model_selection import train_test_split

ANOTHER_RANDOM_STATE=1234
Xtr, Xts, ytr, yts = train_test_split(X, y, random_state=ANOTHER_RANDOM_STATE)

# Use the training set to build a LinearRegression model
lr = LinearRegression().fit(Xtr, ytr)

# Calculate the model's mean_squared_error in the test set
mean_squared_error(yts, lr.predict(Xts))









    Out[16]:





27.445922985339557

What kind of enhancements could be done to get better results?

One final observation about splitting datasets

Data is usually split into three different sets:

Training set: this is the data used to actually train the machine learning model;
Validation set (not shown in this workshop): This set is used to select the best machine learning model among different algorithms or hyperparameters. The idea is that you'd not use this data set to directly train your model but you'd use it to choose the right algorithm, architecture, hyperparameters and so on during the training phase;
Test set: this is a set that is used to asess the model performance after the best model is trained and selected using the training and validation sets. This data set is ideally used only once to evaluate how the model would perform on completely unseen data.

You can refer to this post for more information. There are other popular approaches like Cross-Validation for model tuning and evaluation, but they won't be covered during this workshop.

	crim	zn	indus	nox	rm	age	dis	rad	tax	ptratio	lstat	medv
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1	296	15.3	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2	242	17.8	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2	242	17.8	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3	222	18.7	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3	222	18.7	5.33	36.2

	crim	zn	indus	chas	nox	rm	age	dis	rad	tax	ptratio	lstat	medv
count	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000
mean	3.613524	11.363636	11.136779	0.069170	0.554695	6.284634	68.574901	3.795043	9.549407	408.237154	18.455534	12.653063	22.532806
std	8.601545	23.322453	6.860353	0.253994	0.115878	0.702617	28.148861	2.105710	8.707259	168.537116	2.164946	7.141062	9.197104
min	0.006320	0.000000	0.460000	0.000000	0.385000	3.561000	2.900000	1.129600	1.000000	187.000000	12.600000	1.730000	5.000000
25%	0.082045	0.000000	5.190000	0.000000	0.449000	5.885500	45.025000	2.100175	4.000000	279.000000	17.400000	6.950000	17.025000
50%	0.256510	0.000000	9.690000	0.000000	0.538000	6.208500	77.500000	3.207450	5.000000	330.000000	19.050000	11.360000	21.200000
75%	3.677083	12.500000	18.100000	0.000000	0.624000	6.623500	94.075000	5.188425	24.000000	666.000000	20.200000	16.955000	25.000000
max	88.976200	100.000000	27.740000	1.000000	0.871000	8.780000	100.000000	12.126500	24.000000	711.000000	22.000000	37.970000	50.000000

	crim	zn	indus	nox	rm	age	dis	rad	tax	ptratio	lstat
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1	296	15.3	4.98
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2	242	17.8	9.14
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2	242	17.8	4.03
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3	222	18.7	2.94
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3	222	18.7	5.33